CRIM 515 Final Project Template

02 December 2024

My Name

The Final Project follows the same general flow as the Research Projects, as outlined below. Use this format as much or as little as you’d like, as long as the required components are covered.

1. Research Question

The goal of this project is to forecast future calls for service activity in the City of Fairfax, Virginia for calendar year 2025.

These instructions will demonstrate how to forecast using an ARIMA model. An ARIMA model is an Autoregressive Integrated Moving Average, which is a univariate model predicting future values of a single variable over time. Let’s breakdown these terms:

Your research question should consider behavior, space, and time. You’ll either measure or control for each of these.

For example, using 2008-2024 data, forecast trespass calls for service counts and locations in the City of Fairfax, VA for each month in 2025.

2. Data

Acquire

  1. Load the libraries you need for success:
library(tidyverse)
library(leaflet)
library(sf)
library(tigris)
library(zoo)
library(aTSA)
library(tseries)
library(forecast)
  1. Get the latest and greatest Fairfax calls for service data:
calls.full <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
                               "1ti7BCMe1vvDVVr75p6lqvgouCy_ae7rn"))

Wrangle

  1. Format the date, as a date, and add month columns (different styles) for later:
calls.full$DATE <- as.Date(calls.full$date, format = "%Y-%m-%d")
calls.full$MONTHS <- months(calls.full$DATE)
calls.full$MONTH <- month(calls.full$DATE)
  1. Determine the total count and percentage of each call type:
sum.calls <- calls.full %>%
  group_by(type) %>%
  summarise(call.count = n()) %>%
  mutate(pct = round(call.count/sum(call.count)*100,2))
  1. Use the table from #4 to find and filter the call types to forecast (choose your own!):
group <- c("SUSPICIOUS", "MISC")
calls.subset <- calls.full %>% filter(calls.full$type %in% group)

3. Methods

Building a relevant methodology is largely dependent on your research question and data. Any good methodology involves multiple steps for acquiring (done above), wrangling (done above), calculating, visualizing, and analyzing data. There is no single right answer here; rather, its the ability to craft a unique, distinct workflow that results in innovative work.

Some important considerations for how you use your data include:

To use an ARIMA model, we will identify values for p, d, and q:

Calculate

  1. Calculate calls per day:
calls <- calls.subset %>%
  group_by(DATE) %>%
  summarise(CALL.COUNT = n())
  1. Fill in blank days:
calls <- calls %>%
  complete(DATE = seq.Date(min(DATE), max(DATE), by = "day")) %>%
  mutate(WEEKDAY = lubridate::wday(DATE, label = T, week_start = 1), 
         MONTH = lubridate::month(DATE, label = T, abbr = F),
         WEEK = isoweek(DATE),
         DAY = day(DATE),
         YEAR = year(DATE))
# replace the NAs with 0s
calls <- replace(calls, is.na(calls), 0)
  1. Change the data type to a proper time series, update the sequencing:
cleaned.calls <- zoo(calls$CALL.COUNT, 
                       seq(from = as.Date(min(calls$DATE)), 
                           to = as.Date(max(calls$DATE)), by = 1))
  1. Generate basic summary stats and a basic graph:
summary(cleaned.calls)
##      Index            cleaned.calls   
##  Min.   :2007-01-01   Min.   : 0.000  
##  1st Qu.:2011-06-17   1st Qu.: 4.000  
##  Median :2015-12-01   Median : 6.000  
##  Mean   :2015-12-01   Mean   : 6.186  
##  3rd Qu.:2020-05-16   3rd Qu.: 8.000  
##  Max.   :2024-10-31   Max.   :37.000
plot(cleaned.calls)
title("My Calls Per Day") # this adds a title to your graph

  1. Make the data stationary (normalized, as a deviation from previous values), and a new basic graph. You can do more/higher differences as necessary (but probably don’t need to). Examine the graphs and compare the spikes. These are just examples:
stationary1 <- diff(cleaned.calls, differences = 1)
plot(stationary1)

stationary2 <- diff(cleaned.calls, differences = 2)
plot(stationary2)

stationary3 <- diff(cleaned.calls, differences = 3)
plot(stationary3)

Look for the data with the narrowest y axis range, as consistently as close to 0 as possible. In this example, stationary1 is the best dataset.

  1. Conduct an Augmented Dickey-Fuller test to determine if the data is stationary. The resultant score should be a negative number, and the lower the number the stronger the data is stationary. In this example, ‘cleaned.calls’ is run as a comparison to data that is not differenced.
adf.test(as.matrix(cleaned.calls))
## Warning in adf.test(as.matrix(cleaned.calls)): p-value smaller than printed
## p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  as.matrix(cleaned.calls)
## Dickey-Fuller = -8.6457, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
adf.test(as.matrix(stationary1))
## Warning in adf.test(as.matrix(stationary1)): p-value smaller than printed
## p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  as.matrix(stationary1)
## Dickey-Fuller = -30.532, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
adf.test(as.matrix(stationary2))
## Warning in adf.test(as.matrix(stationary2)): p-value smaller than printed
## p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  as.matrix(stationary2)
## Dickey-Fuller = -41.14, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
adf.test(as.matrix(stationary3))
## Warning in adf.test(as.matrix(stationary3)): p-value smaller than printed
## p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  as.matrix(stationary3)
## Dickey-Fuller = -49.399, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary

In this example, all scenarios are negative. When combined with the plot analysis in Step 5, stationary1 is best. Ideally you want to manipulate the data as little as possible. While ‘stationary2’ and ‘stationary3’ are even more negative, they are not negative enough to warrant choosing them.

  1. Lagged autocorrelations - specifically using the Autocorrelation (ACF) and Partial Autocorrelation (PACF) functions - will help determine the p and q values. For both calculations and visuals, you will use the best stationary dataset you chose in Steps 5 and 6 above.
pacf(stationary1)

pacf(stationary1, pl=FALSE)
## 
## Partial autocorrelations of series 'stationary1', by lag
## 
##      1      2      3      4      5      6      7      8      9     10     11 
## -0.452 -0.347 -0.242 -0.217 -0.170 -0.118 -0.092 -0.113 -0.094 -0.066 -0.050 
##     12     13     14     15     16     17     18     19     20     21     22 
## -0.088 -0.149 -0.057 -0.018 -0.031 -0.032 -0.057 -0.046 -0.029 -0.051 -0.040 
##     23     24     25     26     27     28     29     30     31     32     33 
## -0.026  0.019 -0.038 -0.048 -0.080 -0.030  0.009 -0.004 -0.025 -0.036 -0.028 
##     34     35     36     37     38 
## -0.015 -0.022 -0.037 -0.009  0.017
acf(stationary1)

acf(stationary1, pl=FALSE)
## 
## Autocorrelations of series 'stationary1', by lag
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##  1.000 -0.452 -0.072  0.031 -0.019  0.016  0.015 -0.009 -0.025  0.018  0.011 
##     11     12     13     14     15     16     17     18     19     20     21 
## -0.004 -0.030 -0.021  0.082 -0.011 -0.034  0.004 -0.015  0.020  0.011 -0.024 
##     22     23     24     25     26     27     28     29     30     31     32 
##  0.008  0.008  0.021 -0.052  0.010 -0.010  0.048  0.007 -0.037 -0.010  0.004 
##     33     34     35     36     37     38 
##  0.018  0.007 -0.017 -0.009  0.028  0.008

In this example, the p should be 8, and the q should be 1.

  1. Build an ARIMA function using the inputs derived from previous steps (based on analysis from the graph in Step 4, it’s reasonable to assume our data has season trends - so we’ll set it to TRUE):
arima.calls <- auto.arima(cleaned.calls, d = 1, max.p = 8, max.q = 1, seasonal = T)
summary(arima.calls)
## Series: cleaned.calls 
## ARIMA(2,1,1) 
## 
## Coefficients:
##          ar1      ar2      ma1
##       0.0600  -0.0350  -0.9519
## s.e.  0.0132   0.0132   0.0049
## 
## sigma^2 = 8.431:  log likelihood = -16183.71
## AIC=32375.41   AICc=32375.42   BIC=32402.54
## 
## Training set error measures:
##                        ME    RMSE      MAE  MPE MAPE      MASE         ACF1
## Training set -0.007592036 2.90269 2.228652 -Inf  Inf 0.7497139 0.0006917396

Note that the ‘auto.arima’ function runs a series of ARIMA models and choose the best fit for the data. The best fitting model can be interpreted as ARIMA(d, p, q).

  1. Do a quick residuals checks on the model to see how it performed:
checkresiduals(arima.calls)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1)
## Q* = 14.957, df = 7, p-value = 0.03655
## 
## Model df: 3.   Total lags used: 10

Proper residuals derived from your model include:

  1. Use the ARIMA function to do a sample forecast for the next week of data:
# h = the number of units of time to measure
forecast.7days <- forecast(arima.calls, h=7)
round(sum(forecast.7days$upper[,2]),0)
## [1] 74
forecast.7days$mean
## Time Series:
## Start = 20028 
## End = 20034 
## Frequency = 1 
## [1] 4.763555 4.819397 4.831025 4.829767 4.829285 4.829300 4.829318
  1. Identify the number of days between the last date in your dataset and the end of 2025:
forecast.window <- as.numeric(as.Date("2025-12-31")-max(calls$DATE))
  1. Forecast the number of calls per day for everyday in that window, and plot it:
forecast.2025 <- forecast(arima.calls, h=forecast.window)
autoplot(forecast.2025)

  1. Extract the forecasted values as a table, clean up the column names, add the forecast date, and month:
forecast.values <- as.data.frame(forecast.2025$mean)
forecast.values$ID <- seq.int(nrow(forecast.values))
forecast.upper <- as.data.frame(forecast.2025$upper)
forecast.upper$ID <- seq.int(nrow(forecast.upper))
forecast.values <- forecast.values %>%
  left_join(forecast.upper, by = 'ID')
colnames(forecast.values) <- c("MEAN", "ID", "CI80", "CI90")
forecast.values$DATE <- as.Date(max(calls$DATE) + forecast.values$ID)
forecast.values$MONTH <- months(forecast.values$DATE)
  1. Filter to 2025, summarize forecasts by month:
forecast.values.2025 <- subset(forecast.values, forecast.values$DATE > '2024-12-31')
forecast.months <- forecast.values.2025 %>%
  group_by(MONTH) %>%
  summarise(MEAN = round(sum(MEAN),0), FORECAST.90 = round(sum(CI90),0), FORECAST.80 = round(sum(CI80),0))
forecast.months$DIFF <- forecast.months$FORECAST.90 - forecast.months$FORECAST.80

Visualize

  1. Graph of calls with a trend line:
graph.calls <- ggplot(calls, aes(x=DATE, y=CALL.COUNT)) + 
  geom_line() + 
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") + 
  xlab("Years") + 
  ylab("Call Count") + 
  ggtitle("TITLE HERE") + 
  geom_area(fill="lightblue", color="black")

graph.calls + 
  geom_smooth(method = lm, col = "red", se = FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

  1. Graph actual calls with predicted calls:
calls2024 <- calls[c(1,2)]
calls2025 <- forecast.values[c(5,1)]
names(calls2025) <- c("DATE", "CALL.COUNT")

new.calls <- rbind(calls2024, calls2025)

graph.new.calls <- ggplot(new.calls, aes(x=DATE, y=CALL.COUNT)) + 
  geom_line() + 
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") + 
  xlab("Years") + 
  ylab("Call Count") + 
  ggtitle("TITLE HERE") + 
  geom_area(fill="lightblue", color="black")

graph.new.calls + 
  geom_smooth(method = lm, col = "red", se = FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'

  1. Tweak the data table for making a graph:
forecast.months$MONTH <- factor(forecast.months$MONTH, levels = forecast.months$MONTH)
forecast.long <- pivot_longer(forecast.months, cols = c(FORECAST.80, DIFF), 
                          names_to = "Category", values_to = "Value")
forecast.long$Category <- gsub("DIFF", "90% Confidence Interval", forecast.long$Category)
forecast.long$Category <- gsub("FORECAST.80", "80% Confidence Interval", forecast.long$Category)
  1. Graph your monthly forecasts, first for the upper bounds, and then for the mean:
ggplot(forecast.long, aes(x = MONTH, y = Value, fill = fct_rev(Category))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = Value), size = 3, colour = 'white', position = position_stack(vjust = 0.5)) + 
  labs(title = "City of Fairfax 2025 Monthly Forecast Upper Bounds", 
       x = "Month", 
       y = "Call Count") +
  scale_fill_manual(values = c("90% Confidence Interval" = "blue",
                               "80% Confidence Interval" = "grey"),
                    name = "Forecasts") +
  scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", "July", "August",
                              "September", "October", "November", "December")) + 
  coord_cartesian(ylim = c(0, 500)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(forecast.months, aes(x = MONTH, y = MEAN)) +
  geom_bar(stat = "identity") +
  scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", "July", "August",
                              "September", "October", "November", "December")) +
  coord_cartesian(ylim = c(100, 175)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "City of Fairfax 2025 Mean Monthly Forecast", 
       x = "Month", 
       y = "Call Count")

  1. Map your monthly hotspots.
fairfax.roads <- roads("VA", "Fairfax city")
fairfax.outline <- county_subdivisions("VA", "Fairfax city")
ggplot() +
  geom_sf(data = fairfax.outline, color = "grey") +
  geom_hex(aes(x = lon, y = lat), data = calls.subset, bins = 8) +
  scale_fill_continuous(type = "viridis") +
  geom_sf(data = fairfax.roads, color = "black", alpha = 0.4) +
  theme_classic() +
  facet_wrap(~ MONTH)

4. Findings

INSTRUCTIONS

Writing Tips

Related, some general notes about writing. Welcome to grad school, where the quality of your writing is held to a higher standard. Write directly. Thus, avoid using these words and phrases:

Also, words have meanings. Some are stronger than others. Others aren’t real.

Further, here’s some other strategies to consider:

And finally, this is just fun.

Other Information/Reminders

  1. When submitting your projects, you are submitting either an HTML or PDF output of your Markdown file, as an upload to Canvas - not via email.
  2. When you are ready to see what the output looks like, click the ‘Knit’ button. The output file will be saved to the same folder this script is saved to.
  3. Never ever ever include any install.packages functions. Either comment them out with a #, or delete them.

Grading: 30pts

1. Research Question

2pt

Provide a brief paragraph (1-2 sentences) on the research question. State the question, in terms of behavior, space, and time. Define key terms and concepts, to include how variables will be measured.

2. Literature Review

2pts

Provide a brief paragraph describing a prior study related to your current project. This study should be from a peer-academic academic journal. Include details about the research question, data, methods, and findings. Cite the article in APA format.

3. Data

1pt

Provide a brief paragraph on the data used in this study. This includes the specific source(s), and the specific temporal and spatial constraints. Address any major advantages and disadvantages with using these data compared to other sources. Be specific enough that a reader could replicate this work.

4. Methods

2pts

Provide a brief paragraph on the research methods leveraged to answer this question. This includes the software, calculations, skills, techniques, and unique workflows used to analyze your data and develop an answer. You do NOT need to describe click-by-click instructions or lines of code describing how you did things; you DO need to describe a logical process that is specific enough that a reader could replicate.

5. Findings

12pts

Provide one brief paragraph (4-5 sentences) per month forecasted (2pts each). Each paragraph should include forecasted counts, and specific areas of the city to focus on. Your findings should be descriptive, clear, and comprehensively answer the original question.

7. Visuals

10pts

8. Formatting

1pt

ONE LAST THING… the source code for creating this file can be found here.

Please email me with any questions.